Implement the SuperVectorizer and dirty_cat's encoders to the search space #169

LilianBoulard · 2022-08-25T10:09:04Z

This PR aims at implementing dirty_cat's encoders (currently SimilarityEncoder, GapEncoder and MinHashEncoder) to GAMA's search space via the use of the SuperVectorizer.

The point of adding the dirty_cat encoders is for GAMA to be able to handle dirty categorical features in tabular data.

Using the SuperVectorizer gives a simplified interface to the sklearn's ColumnTransformer, and allows to mix & match different encoding techniques.

For the content of this PR to run, the features implemented in dirty_cat 0.3 are required. However, at the time of writing these lines (August 2022), this version is not out yet.

TODO:

wait for dirty_cat 0.3 to be out
fine-tune the preprocessing search space
benchmark GAMA to compare the performance before and after the introduction of the SuperVectorizer

PGijsbers · 2022-08-25T12:35:00Z

Please give me a ping here as soon as dirty cat 0.3 is released :)

LilianBoulard · 2022-09-14T09:00:32Z

Hi Pieter, dirty_cat 0.3 is out!

PGijsbers · 2022-09-14T09:51:23Z

I allowed CI now, I'll try to have a closer look over this week and the next. I will probably do the 22.0.0 release without (since I was planning to do that today or tomorrow, as the current PyPI package is broken due to updated dependencies), so ignore the message about adding things to the changelog; I'll do that later when preparing for 22.1.0.

PGijsbers · 2022-09-14T09:55:53Z

Ah, it looks like the unit tests which used pre-defined individuals are broken now (to be expected). I am not entirely sure how I want to fix that - that will depend on whether or not we want to allow for the old behavior to be used as an alternative, and that would depend on a small benchmark. So I don't think there's much you can do right now as far as improving the tests/code.

Running some additional experiments to define a sensible default search space, as noted in the OP, should be possible and is appreciated :)

LilianBoulard added 7 commits August 4, 2022 15:37

[WIP] Implement SuperVectorizer to preprocessing

3140958

[WIP] Implement SuperVectorizer to preprocessing

6c177d3

[WIP] Implement SuperVectorizer to config

4a8b79a

Merge branch 'implement_dirty_cat' into typed_data_terminals

1a167d3

Implement SuperVectorizer

48ea830

Cleanup search space

8420428

Add dependency

710ea92

Fix "No candidates to chose from"

16537df

PGijsbers added this to the v22.1 milestone Sep 16, 2022

PGijsbers mentioned this pull request Apr 29, 2023

Using numpy arrays as data source may lead to errors if inferred encoding is used #193

Open

PGijsbers mentioned this pull request May 22, 2023

[Implementation] Developing a Custom Auto-ML System Variant with GAMA: Questions & Concerns #198

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the SuperVectorizer and dirty_cat's encoders to the search space #169

Implement the SuperVectorizer and dirty_cat's encoders to the search space #169

LilianBoulard commented Aug 25, 2022 •

edited by PGijsbers

Loading

PGijsbers commented Aug 25, 2022

LilianBoulard commented Sep 14, 2022

PGijsbers commented Sep 14, 2022

PGijsbers commented Sep 14, 2022 •

edited

Loading

Implement the SuperVectorizer and dirty_cat's encoders to the search space #169

Are you sure you want to change the base?

Implement the SuperVectorizer and dirty_cat's encoders to the search space #169

Conversation

LilianBoulard commented Aug 25, 2022 • edited by PGijsbers Loading

PGijsbers commented Aug 25, 2022

LilianBoulard commented Sep 14, 2022

PGijsbers commented Sep 14, 2022

PGijsbers commented Sep 14, 2022 • edited Loading

LilianBoulard commented Aug 25, 2022 •

edited by PGijsbers

Loading

PGijsbers commented Sep 14, 2022 •

edited

Loading